In this section, we delve into the application of ARIMAX, SARIMAX, and VAR modeling techniques across a range of time series datasets pertinent to our project’s scope. ARIMAX and SARIMAX models come into play when we observe a unidirectional influence from one variable onto another. Conversely, VAR models are more suited to scenarios where a set of variables exhibits mutual influences.
Initiating this phase of the project involved a comprehensive literature review to identify the models previously employed to analyze COVID vaccination-related variables. The study by Yundari (2022) leverages the Vector Autoregressive (VAR) model to simultaneously analyze the impact of vaccination numbers (first dose) on new and recovered COVID-19 cases, utilizing daily data from January 13 to December 30, 2021, in West Kalimantan Province. Ye (2021) underscores the critical nature of escalating COVID-19 Daily Vaccination Numbers to curtail the pandemic, noting that achieving vaccination targets remains an uphill task. This study probes into how political partisanship might influence Daily Vaccination Numbers, drawing comparisons between Democratic and Republican counties. Asch (2024) delves into the correlation between political leanings and the reporting of vaccine adverse events (AEs), revealing that a 10% uptick in Republican votes at the state level correlates with a 5% increase in the likelihood of reporting a COVID-19 vaccine AE.
References
Yundari, Y., & Huda, N. M. Analysis of the Impact of Vaccination on Daily New and Recovered COVID-19 Cases Using the Vector Autoregressive (VAR) Model: A Case Study of West Kalimantan. BAREKENG: Jurnal Ilmu Matematika dan Terapan. https://ojs3.unpatti.ac.id/index.php/barekeng/article/view/5266
Ye, X. (2021). Exploring the Relationship Between Political Partisanship and COVID-19 Vaccination Rate. https://academic.oup.com/jpubhealth/article/45/1/91/6409075
Asch, D.A., Luo, C., & Chen, Y. (2024). Reports of COVID-19 Vaccine Adverse Events in Predominantly Republican vs Democratic States. JAMA Network Open, 7(3), e244177. https://doi.org/10.1001/jamanetworkopen.2024.4177
2. Define Models
Armed with enhanced insights into the relevant variables, this section is dedicated to defining the specific ARIMAX/SARIMAX/VAR models to be utilized for analyzing changes in Daily Vaccination Numbers. Each model is tailored to examine different facets of the interplay between COVID-19 vaccination efforts and socio-economic as well as political dynamics.
2.1 Confirmed COVID-19 Cases as a Function of Daily Vaccination Number:
(ARIMAX)Confirmed COVID-19 Cases ~ Daily Vaccination Number
This model aims to capture the impact of Daily Vaccination Numbers on the trend of newly confirmed COVID-19 cases.
Code
# Read the datasetvac_df <-read_csv("Datasets/us_state_vaccinations.csv")# Select relevant columnscols_show <-c('date', 'location', 'daily_vaccinations_per_million')t <- vac_df[, cols_show]# Group by 'date' and summarize columns, ignoring NA valuest1 <- t %>%group_by(date) %>%summarize(daily_vaccinations_per_million =sum(daily_vaccinations_per_million, na.rm =TRUE) )# Convert date column to Date formatt1$date <-as.Date(t1$date)# Newly confirmed caseswide_data <-read_csv("Datasets/covid_confirmed_usafacts.csv")# Define the key and value columns for pivotingkey_cols <-c("countyFIPS", "County Name", "State", "StateFIPS")value_cols <-setdiff(names(wide_data), key_cols)# Pivot the data from wide to longlong_data <-pivot_longer( wide_data,cols = value_cols,names_to ="date",values_to ="value")# Group by 'State' and 'date', and calculate the sum of Confirmed Casescon_case_df <- long_data %>%group_by(date) %>%summarize(value_sum =sum(value, na.rm =TRUE))# Convert date column to Date formatcon_case_df$date <-as.Date(con_case_df$date)# Merge them togethercava_df <-merge(t1, con_case_df, by ="date", all =FALSE)head(cava_df)
# Shape of the dfcat('The shape of this dataframe is', dim(cava_df))
The shape of this dataframe is 872 3
2.2 COVID-19 Death Cases as a Function of Daily Vaccination Number:
(ARIMAX)COVID-19 Death Cases ~ Daily Vaccination Number
Through this model, we explore how variations in Daily Vaccination Numbers influence the count of COVID-19-related fatalities.
Code
# Read the datasetvac_df <-read_csv("Datasets/us_state_vaccinations.csv")# Select relevant columnscols_show <-c('date', 'location', 'daily_vaccinations_per_million')t <- vac_df[, cols_show]# Group by 'date' and summarize columns, ignoring NA valuest1 <- t %>%group_by(date) %>%summarize(daily_vaccinations_per_million =sum(daily_vaccinations_per_million, na.rm =TRUE) )# Convert date column to Date formatt1$date <-as.Date(t1$date)# Death caseswide_data <-read_csv("Datasets/covid_deaths_usafacts.csv")# Define the key and value columns for pivotingkey_cols <-c("countyFIPS", "County Name", "State", "StateFIPS")value_cols <-setdiff(names(wide_data), key_cols)# Pivot the data from wide to longlong_data <-pivot_longer( wide_data,cols = value_cols,names_to ="date",values_to ="value")# Group by 'State' and 'date', and calculate the sum of Confirmed Casesdead_case_df <- long_data %>%group_by(date) %>%summarize(value_sum =sum(value, na.rm =TRUE))# Convert date column to Date formatdead_case_df$date <-as.Date(dead_case_df$date)# Merge them togetherdeva_df <-merge(t1, dead_case_df, by ="date", all =FALSE)head(deva_df)
# Shape of the dfcat('The shape of this dataframe is', dim(deva_df))
The shape of this dataframe is 872 3
2.3 Unemployment Rate as Influenced by Daily Vaccination Number and Confirmed Cases:
(ARIMAX)Unemployment Rate ~ Daily Vaccination Number + Confirmed COVID-19 Cases
This model investigates the effect of Daily Vaccination Numbers and COVID-19 case counts on the unemployment rate, providing insights into the pandemic’s broader economic implications.
Code
# Read the datasetvac_df <-read_csv("Datasets/us_state_vaccinations.csv")# Select relevant columnscols_show <-c('date', 'location', 'daily_vaccinations_per_million')t <- vac_df[, cols_show]# Group by 'date' and summarize columns, ignoring NA valuest1 <- t %>%group_by(date) %>%summarize(daily_vaccinations_per_million =sum(daily_vaccinations_per_million, na.rm =TRUE) )# Convert date column to Date formatt1$date <-as.Date(t1$date)# Aggregate data to monthly level using mean for each columnt1 <- t1 %>%group_by(date =format(date, "%Y-%m")) %>%summarize(daily_vaccinations_per_million =mean(daily_vaccinations_per_million, na.rm =TRUE))# Transform the date column type with specified formatt1$date <-as.Date(paste0(t1$date, "-01-01"))# Newly confirmed caseswide_data <-read_csv("Datasets/covid_confirmed_usafacts.csv")# Define the key and value columns for pivotingkey_cols <-c("countyFIPS", "County Name", "State", "StateFIPS")value_cols <-setdiff(names(wide_data), key_cols)# Pivot the data from wide to longlong_data <-pivot_longer( wide_data,cols = value_cols,names_to ="date",values_to ="value")# Group by 'State' and 'date', and calculate the sum of Confirmed Casescon_case_df <- long_data %>%group_by(date) %>%summarize(value_sum =sum(value, na.rm =TRUE))# Convert date column to Date formatcon_case_df$date <-as.Date(con_case_df$date)# Aggregate data to monthly level using mean for each columncon_case_df <- con_case_df %>%group_by(date =format(date, "%Y-%m")) %>%summarize(value_sum =mean(value_sum, na.rm =TRUE))# Transform the date column type with specified formatcon_case_df$date <-as.Date(paste0(con_case_df$date, "-01-01"))unemp <-read_csv('Datasets/unemployment.csv')key_cols <-c("Location")value_cols <-setdiff(names(unemp), key_cols)unemp1 <-pivot_longer( unemp,cols = value_cols,names_to ="Time",values_to ="Unemployment")# Convert Time column to Date formatunemp1$date <-as.Date(paste0(unemp1$Time, "-01"))# Convert Unemployment column to numeric (floating-point) formatunemp1$Unemployment <-as.numeric(unemp1$Unemployment)# Focus on USunemp2 <- unemp1[unemp1$Location =='United States',]unemp3 <- unemp2[,c('date', 'Unemployment')]# Merge them togetherdd_df <-merge(t1, con_case_df, by ="date", all =FALSE)ddd_df <-merge(dd_df, unemp3, by ="date", all =FALSE)head(ddd_df)
# Shape of the dfcat('The shape of this dataframe is', dim(ddd_df))
The shape of this dataframe is 16 4
2.4 Interdependence between Daily Vaccination Number and Independent Party Support Rate:
(VAR)Daily Vaccination Number ~ Independent Party Support Rate
This model evaluates how vaccination efforts and support for independent political parties influence each other over time.
Code
# Read the datasetvac_df <-read_csv("Datasets/us_state_vaccinations.csv")# Select relevant columnscols_show <-c('date', 'location', 'daily_vaccinations_per_million')t <- vac_df[, cols_show]# Group by 'date' and summarize columns, ignoring NA valuest1 <- t %>%group_by(date) %>%summarize(daily_vaccinations_per_million =sum(daily_vaccinations_per_million, na.rm =TRUE) )# Convert date column to Date formatt1$date <-as.Date(t1$date)demo <-read_excel('Datasets/party.xlsx',sheet ='Democrat')inde <-read_excel('Datasets/party.xlsx',sheet ='Independent')rep <-read_excel('Datasets/party.xlsx',sheet ='Republican')# Transform the wide dataframe into a long dataframekey_cols <-c("Attitude")value_cols <-setdiff(names(demo), key_cols)demo1 <-pivot_longer( demo,cols = value_cols,names_to ="date",values_to ="democrat")inde1 <-pivot_longer( inde,cols = value_cols,names_to ="date",values_to ="independent")rep1 <-pivot_longer( rep,cols = value_cols,names_to ="date",values_to ="republican")# Combine these three datasets togethercombined_data <-full_join(demo1, inde1, by =c("date", "Attitude")) %>%full_join(rep1, by =c("date", "Attitude"))combined_data1 <- combined_data[combined_data$Attitude=='Favorable',]# Define the key and value columns for pivotingkey_cols <-c("Attitude", "date")value_cols <-setdiff(names(combined_data1), key_cols)# Pivot the data from wide to longcombined_data2 <-pivot_longer( combined_data1,cols = value_cols,names_to ="Party",values_to ="value")# Convert date column to Date formatcombined_data2$date <-as.Date(combined_data2$date)# Subset to each partydemo_data <- combined_data2[combined_data2$Party=='democrat',][,c("date","value")]inde_data <- combined_data2[combined_data2$Party=='independent',][,c("date","value")]rep_data <- combined_data2[combined_data2$Party=='republican',][,c("date","value")]inde_df <-merge(t1, inde_data, by ="date", all =FALSE)head(inde_df)
# Shape of the dfcat('The shape of this dataframe is', dim(rep_df))
The shape of this dataframe is 117 3
3. ARIMAX/SARIMAX
3.1 Variable Selection
To effectively construct an ARIMAX/SARIMAX model, it is crucial to select and define the response variable, which will serve as the primary focus of the analysis. Additionally, identifying relevant exogenous variables—those external predictor variables that are hypothesized to influence the response variable—is essential. This process involves a careful examination of potential predictors to ensure that they are not only relevant but also appropriately measured and free from collinearity issues. The selection of these variables should be guided by both theoretical considerations and empirical evidence, ensuring that they meaningfully contribute to the dynamics of the model. By integrating these exogenous variables, the ARIMAX/SARIMAX model can capture complex interactions and provide nuanced insights into how external factors may drive changes in the response variable over time.
cava_df.ts<-ts(cava_df,star=decimal_date(as.Date("2020-12-20",format ="%Y-%m-%d")),frequency =365.25)autoplot(cava_df.ts[,c(2:3)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Daily changes in new confirmed case and vaccination number")
Code
deva_df.ts<-ts(deva_df,star=decimal_date(as.Date("2020-12-20",format ="%Y-%m-%d")),frequency =365.25)autoplot(deva_df.ts[,c(2:3)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Daily changes in dead case and new vaccination number")
Code
ddd_df.ts<-ts(ddd_df,star=decimal_date(as.Date("2020-12-01",format ="%Y-%m-%d")),frequency =12)autoplot(ddd_df.ts[,c(2:4)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Monthly changes in new vaccination number, confirm case, and unemployment rate")
The Standard Residual Plot exhibits some inconsistencies in terms of mean and variance, suggesting potential issues with the model’s assumptions about homoscedasticity or linearity.
The Autocorrelation Function (ACF) plot reveals minimal correlation among residuals, suggesting that the model has effectively captured most of the underlying process; however, the presence of white noise indicates that the model may not be capturing all the data’s nuances, which could point to an inadequate model fit.
The Quantile-Quantile (Q-Q) Plot shows that the residuals are mostly aligned with a normal distribution, suggesting that the normality assumption for the errors may largely hold, which is a positive indicator for the model’s reliability.
Results from the Ljung-Box test show p-values below the 0.05 significance level, which typically suggests a lack of fit as it indicates autocorrelation in the residuals at lag intervals that could influence the model’s accuracy.
The coefficient table ($ttable) shows that the parameters ar1, ar2, ma4, and ma5 are statistically significant, highlighting their critical influence on the model’s dynamics and underscoring the importance of these terms in explaining the time series variability.
The Standard Residual Plot suggests some irregularities in both the mean and the variance of the residuals, pointing to potential issues in the homoscedasticity or distribution assumptions of the model.
The Autocorrelation Function (ACF) Plot reveals minimal correlation among residuals, suggesting that most of the systematic information in the data has been captured by the model. However, the presence of any remaining correlation, albeit slight, could indicate that the model might not be capturing all underlying patterns effectively, which is indicative of a suboptimal fit.
The Quantile-Quantile (Q-Q) Plot shows that the residuals are approximately normally distributed, with only minor deviations from normality. This aspect of the diagnostics is generally positive, indicating that the assumption of normality is reasonably satisfied.
Results from the Ljung-Box Test show p-values below the 0.05 significance level, suggesting that there is still some autocorrelation present in the residuals at lagged intervals. This result contradicts the ideal scenario where p-values would significantly exceed the threshold, confirming the absence of autocorrelation and a good fit.
Coefficient Significance Table: All the coefficients in the model are statistically significant, suggesting that each predictor contributes meaningfully to the model. However, significance of coefficients alone does not necessarily imply an overall effective model fit, as it does not account for potential issues in other diagnostic areas.
Series: ddd_df.ts[, "Unemployment"]
Regression with ARIMA(0,0,0) errors
Coefficients:
intercept daily_vaccinations_per_million value_sum
0.0732 0e+00 0e+00
s.e. 0.0033 2e-04 2e-04
sigma^2 = 1.838e-05: log likelihood = 66.19
AIC=-124.38 AICc=-120.75 BIC=-121.29
Training set error measures:
ME RMSE MAE MPE MAPE
Training set -2.636216e-14 0.003864599 0.003345663 -0.4910086 7.187009
MASE ACF1
Training set 0.1351783 0.6529014
Code
checkresiduals(fit3)
Ljung-Box test
data: Residuals from Regression with ARIMA(0,0,0) errors
Q* = 9.5034, df = 3, p-value = 0.0233
Model df: 0. Total lags used: 3
Coefficients:
Estimate SE t.value p.value
xmean 0 0.001 0 1
sigma^2 estimated as 1.493512e-05 on 15 degrees of freedom
AIC = -8.023918 AICc = -8.006061 BIC = -7.927344
The Standard Residual Plot reveals some inconsistencies in the mean and variance, suggesting potential deviations from homoscedasticity or the presence of outliers that could affect the robustness of the model predictions.
The Autocorrelation Function (ACF) Plot exhibits minimal correlation among the residuals, which suggests that the model has effectively captured the underlying patterns in the data, leaving behind what appears to be white noise. This is a strong indicator of a well-fitting model.
The Quantile-Quantile (Q-Q) Plot closely aligns with the theoretical normal distribution, underscoring the assumption of normality in the model’s residuals, which is essential for the validity of many inferential statistics.
The Ljung-Box Test results are mixed, with half of the p-values falling below the 0.05 significance threshold. This inconsistency suggests that some autocorrelations at different lags are significantly different from zero, indicating potential issues with the model fit.
Significance Table: The statistical significance of all model coefficients is affirmed, implying that each predictor contributes meaningfully to the model. This significance is crucial for understanding the impact of each variable within the model and for making informed predictions or decisions based on the model’s outputs.
Call:
lm(formula = value_sum ~ daily_vaccinations_per_million, data = cava_df.ts)
Residuals:
Min 1Q Median 3Q Max
-67783470 -11303790 6076610 15172166 31376996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.563e+07 1.032e+06 83.01 <2e-16 ***
daily_vaccinations_per_million -1.471e+02 5.458e+00 -26.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20530000 on 870 degrees of freedom
Multiple R-squared: 0.4549, Adjusted R-squared: 0.4543
F-statistic: 726 on 1 and 870 DF, p-value: < 2.2e-16
The variables appear to be significant. Next, looking at the ACF/PACF plot of the residuals.
Code
res.fit1<-ts(residuals(fit.reg),start =decimal_date(as.Date("2020-12-20", format ="%Y-%m-%d")),frequency =365.25)plot1<-ggAcf(res.fit1,20) +theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) plot2<-ggPacf(res.fit1,20)+theme_bw()+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) grid.arrange(plot1, plot2,nrow=2)
There appears to be seasonality still, so I will apply some differencing.
Code
plot1<-ggAcf(diff(res.fit1),20) +theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) plot2<-ggPacf(diff(res.fit1),20)+theme_bw()+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) grid.arrange(plot1, plot2,nrow=2)
According to the plots, we have the following values: p = 1,2,3 q= 1,2,3.
The Standard Residual Plot exhibits some inconsistencies in terms of mean and variance, suggesting potential issues with the model’s assumptions about homoscedasticity or linearity.
The Autocorrelation Function (ACF) Plot exhibits minimal correlation among the residuals, which suggests that the model has effectively captured the underlying patterns in the data, leaving behind what appears to be white noise. This is a strong indicator of a well-fitting model.
The Quantile-Quantile (Q-Q) Plot shows that the residuals are mostly aligned with a normal distribution, suggesting that the normality assumption for the errors may largely hold, which is a positive indicator for the model’s reliability.
Results from the Ljung-Box test show p-values below the 0.05 significance level, which typically suggests a lack of fit as it indicates autocorrelation in the residuals at lag intervals that could influence the model’s accuracy.
Significance Table: The statistical significance of all model coefficients is affirmed, implying that each predictor contributes meaningfully to the model. This significance is crucial for understanding the impact of each variable within the model and for making informed predictions or decisions based on the model’s outputs.
Call:
lm(formula = value_sum ~ daily_vaccinations_per_million, data = deva_df.ts)
Residuals:
Min 1Q Median 3Q Max
-673660 -81409 69407 113393 222586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.980e+05 8.768e+03 113.82 <2e-16 ***
daily_vaccinations_per_million -1.120e+00 4.639e-02 -24.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 174500 on 870 degrees of freedom
Multiple R-squared: 0.4014, Adjusted R-squared: 0.4007
F-statistic: 583.4 on 1 and 870 DF, p-value: < 2.2e-16
The variables appear to be significant. Next, looking at the ACF/PACF plot of the residuals.
Code
res.fit1<-ts(residuals(fit.reg),start =decimal_date(as.Date("2020-12-20", format ="%Y-%m-%d")),frequency =365.25)plot1<-ggAcf(res.fit1,20) +theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) plot2<-ggPacf(res.fit1,20)+theme_bw()+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) grid.arrange(plot1, plot2,nrow=2)
There appears to be seasonality still, so I will apply some differencing.
Code
plot1<-ggAcf(diff(res.fit1),20) +theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) plot2<-ggPacf(diff(res.fit1),20)+theme_bw()+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) grid.arrange(plot1, plot2,nrow=2)
According to the plots, we have the following values: p = 1,2,3 q= 1,2,3.
The Standard Residual Plot exhibits some inconsistencies in terms of mean and variance, suggesting potential issues with the model’s assumptions about homoscedasticity or linearity.
The Autocorrelation Function (ACF) Plot exhibits minimal correlation among the residuals, which suggests that the model has effectively captured the underlying patterns in the data, leaving behind what appears to be white noise. This is a strong indicator of a well-fitting model.
The Quantile-Quantile (Q-Q) Plot shows that the residuals are mostly aligned with a normal distribution, suggesting that the normality assumption for the errors may largely hold, which is a positive indicator for the model’s reliability.
Results from the Ljung-Box test show p-values below the 0.05 significance level, which typically suggests a lack of fit as it indicates autocorrelation in the residuals at lag intervals that could influence the model’s accuracy.
Significance Table: The statistical significance of all model coefficients is affirmed, implying that each predictor contributes meaningfully to the model. This significance is crucial for understanding the impact of each variable within the model and for making informed predictions or decisions based on the model’s outputs.
Call:
lm(formula = Unemployment ~ daily_vaccinations_per_million +
value_sum, data = ddd_df.ts)
Residuals:
Min 1Q Median 3Q Max
-0.008137 -0.002539 0.001572 0.002869 0.004882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.323e-02 3.660e-03 20.008 3.78e-11 ***
daily_vaccinations_per_million 1.333e-09 8.421e-09 0.158 0.877
value_sum -5.306e-10 6.405e-11 -8.283 1.52e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.004287 on 13 degrees of freedom
Multiple R-squared: 0.8501, Adjusted R-squared: 0.827
F-statistic: 36.86 on 2 and 13 DF, p-value: 4.396e-06
The confirmed COVID-19 case number appears to be significant. Next, looking at the ACF/PACF plot of the residuals.
Code
res.fit1<-ts(residuals(fit.reg),start =decimal_date(as.Date("2020-12-01", format ="%Y-%m-%d")),frequency =12)plot1<-ggAcf(res.fit1) +theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) plot2<-ggPacf(res.fit1)+theme_bw()+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA)) grid.arrange(plot1, plot2,nrow=2)
According to the plots, we have the following values: p = 1 q= 1.
Coefficients:
Estimate SE t.value p.value
ar1 0.6799 0.1747 3.8916 0.0016
xmean 0.0007 0.0020 0.3457 0.7347
sigma^2 estimated as 7.862675e-06 on 14 degrees of freedom
AIC = -8.501737 AICc = -8.444045 BIC = -8.356877
The Standard Residual Plot exhibits some inconsistencies in terms of mean and variance, suggesting potential issues with the model’s assumptions about homoscedasticity or linearity.
The Autocorrelation Function (ACF) Plot exhibits minimal correlation among the residuals, which suggests that the model has effectively captured the underlying patterns in the data, leaving behind what appears to be white noise. This is a strong indicator of a well-fitting model.
The Quantile-Quantile (Q-Q) Plot shows that the residuals are mostly aligned with a normal distribution, suggesting that the normality assumption for the errors may largely hold, which is a positive indicator for the model’s reliability.
Results from the Ljung-Box test show p-values above the 0.05 significance level, which typically suggests a good fit as it indicates autocorrelation in the residuals at lag intervals.
Significance Table: The statistical significance of all model coefficients is affirmed, implying that each predictor contributes meaningfully to the model. This significance is crucial for understanding the impact of each variable within the model and for making informed predictions or decisions based on the model’s outputs.
3.4 Cross Validation
To confirm our selection on the model, we choose to use cross validation to compare the RMSE values to help us make decision.
Auto.arima gave me ARIMA(3,1,5), while the manual model selection resulted in ARIMA(3,1,3).
Code
fit.reg <-lm(value_sum ~ daily_vaccinations_per_million, data=cava_df.ts)res.fit1<-ts(residuals(fit.reg),start =decimal_date(as.Date("2020-12-20", format ="%Y-%m-%d")),frequency =365.25)n =length(res.fit1) # 872k =220i=1err1 =c()err2 =c()for(i in1:(n-k)){ xtrain <- res.fit1[1:(k-1)+i] #observations from 1 to 12 xtest <- res.fit1[k+i] #13th observation as the test set fit1 <-Arima(xtrain, order=c(3,1,5),include.drift=FALSE, method="ML") fcast1 <-forecast(fit1, h=1) fit2 <-Arima(xtrain, order=c(3,1,3),include.drift=FALSE, method="ML") fcast2 <-forecast(fit2, h=1)#capture error for each iteration# This is mean absolute error err1 =c(err1, abs(fcast1$mean-xtest)) err2 =c(err2, abs(fcast2$mean-xtest))# This is mean squared error err3 =c(err1, (fcast1$mean-xtest)^2) err4 =c(err2, (fcast2$mean-xtest)^2)#print(i)}RMSE1 <-sqrt(mean(err3, na.rm =TRUE))RMSE2 <-sqrt(mean(err4, na.rm =TRUE))cat('The RMSE of Auto Arima is', RMSE1, '\n')
The RMSE of Auto Arima is 951.6204
Code
cat('The RMSE of Manually Selection is', RMSE2, '\n')
The RMSE of Manually Selection is 1930.114
The Auto Arima model had lower RMSE compared to the manual model. Therefore, the best model is ARIMA(3,1,5).
Auto.arima gave me ARIMA(5,2,2), while the manual model selection resulted in ARIMA(3,1,3).
Code
fit.reg <-lm(value_sum ~ daily_vaccinations_per_million, data=deva_df.ts)res.fit1<-ts(residuals(fit.reg),start =decimal_date(as.Date("2020-12-20", format ="%Y-%m-%d")),frequency =365.25)n =length(res.fit1) # 872k =220i=1err1 =c()err2 =c()for(i in1:(n-k)){ xtrain <- res.fit1[1:(k-1)+i] #observations from 1 to 12 xtest <- res.fit1[k+i] #13th observation as the test set fit1 <-Arima(xtrain, order=c(5,2,2),include.drift=FALSE, method="ML") fcast1 <-forecast(fit1, h=1) fit2 <-Arima(xtrain, order=c(3,1,3),include.drift=FALSE, method="ML") fcast2 <-forecast(fit2, h=1)#capture error for each iteration# This is mean absolute error err1 =c(err1, abs(fcast1$mean-xtest)) err2 =c(err2, abs(fcast2$mean-xtest))# This is mean squared error err3 =c(err1, (fcast1$mean-xtest)^2) err4 =c(err2, (fcast2$mean-xtest)^2)#print(i)}RMSE1 <-sqrt(mean(err3, na.rm =TRUE))RMSE2 <-sqrt(mean(err4, na.rm =TRUE))cat('The RMSE of Auto Arima is', RMSE1, '\n')
The RMSE of Auto Arima is 57.65793
Code
cat('The RMSE of Manually Selection is', RMSE2, '\n')
The RMSE of Manually Selection is 59.52566
The manual model had lower RMSE compared to the Auto Arima model. Therefore, the best model is ARIMA(3,1,3).
Auto.arima gave me ARIMA(0,0,0), while the manual model selection resulted in ARIMA(1,0,0). Since the data sample for this time series only contains 16 observations, we choose to use ARIMA(1,0,0) since ARIMA(0,0,0) contains more biases
3.5 Model Fitting
In this section, we will use the best model to fit the model.
Series: ddd_df.ts[, "Unemployment"]
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 intercept daily_vaccinations_per_million value_sum
0.897 0.066 0 0
s.e. NaN NaN NaN NaN
sigma^2 = 8.212e-06: log likelihood = 72.46
AIC=-134.92 AICc=-128.92 BIC=-131.06
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.0005780741 0.002481684 0.002158275 -1.36368 4.464085 0.08720304
ACF1
Training set -0.08143333
fit1 <-Arima(cava_df.ts[,"value_sum"],order=c(3,1,5), xreg=cava_df.ts[,"daily_vaccinations_per_million"])forecast(fit1, h =100, xreg=cava_df.ts[,"daily_vaccinations_per_million"]) %>%autoplot() +xlab("Date")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
In summary, this forecast suggests that, according to the ARIMA(3,1,5) model used, the daily confirmed cases are expected to remain at a consistent level in the near term, with the uncertainty of the forecast increasing as it projects further out into the future.
Code
fit1 <-Arima(deva_df.ts[,"value_sum"],order=c(3,1,3), xreg=deva_df.ts[,"daily_vaccinations_per_million"])forecast(fit1, h =100, xreg=deva_df.ts[,"daily_vaccinations_per_million"]) %>%autoplot() +xlab("Date")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
In summary, this forecast suggests that, according to the ARIMA(3,1,3) model used, the daily dead cases are expected to remain at a consistent level in the near term, with the uncertainty of the forecast increasing as it projects further out into the future.
Code
fit1 <-Arima(ddd_df.ts[,"Unemployment"],order=c(1,0,0), xreg=ddd_df.ts[,2:3])forecast(fit1, h =4, xreg=ddd_df.ts[,2:3]) %>%autoplot() +xlab("Month")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
In summary, the model seems to predict that the decreasing trend in unemployment observed in the actual data will not persist in the same way, and that unemployment may stabilize at the levels seen at the end of the observed data.
4. VAR
4.1 Variable Selection
When crafting a VAR model, it is pivotal to delineate each variable that will be considered within the system. Unlike ARIMAX/SARIMAX models, which focus on a primary response variable influenced by exogenous predictors, VAR models treat all included variables as endogenously interrelated. Each variable in a VAR model is explained by its own lagged values as well as the lagged values of all other variables in the system, allowing for a multidirectional analysis of influence and response over time.
The selection process for variables in a VAR model is underpinned by both theoretical frameworks and empirical data, necessitating a rigorous review to ascertain their interconnectedness and the absence of unit roots that may indicate non-stationarity. This ensures that each variable not only possesses inherent relevance but also contributes to a comprehensive understanding of the system’s internal dynamics. By employing a VAR model, one can dissect and comprehend the nuanced interplay between variables, enabling a holistic view of how each variable evolves in relation to others within the dataset.
inde_df.ts<-ts(inde_df,star=decimal_date(as.Date("2020-12-22",format ="%Y-%m-%d")),frequency =52)autoplot(inde_df.ts[,c(2:3)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Daily changes in new vaccination number and independent party support rate")
Code
demo_df.ts<-ts(demo_df,star=decimal_date(as.Date("2020-12-22",format ="%Y-%m-%d")),frequency =52)autoplot(demo_df.ts[,c(2:3)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Daily changes in new vaccination number and democratic party support rate")
Code
rep_df.ts<-ts(rep_df,star=decimal_date(as.Date("2020-12-22",format ="%Y-%m-%d")),frequency =52)autoplot(rep_df.ts[,c(2:3)], facets=TRUE, color="#5a3196") +xlab("Year") +ylab("") +theme_bw() +ggtitle("Daily changes in new vaccination number and republican party support rate")
4.2 Variable Selection
In the process of refining our VAR model, a critical step involves selecting the optimal number of lag periods, denoted as ( p ). This selection is pivotal as it significantly influences the model’s accuracy and effectiveness in capturing the dynamics between the time series variables. To determine the best ( p ), we utilize criteria such as the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Hannan-Quinn Information Criterion (HQIC), which help in balancing model complexity and goodness of fit.
Examining the VAR models with ( p = 1 ) and ( p = 3 ) lag orders, we notice the presence of some variables that do not significantly contribute to the model. Notably, the model with ( p = 3 ) lag periods tends to have fewer significant variables compared to the ( p = 1 ) model. However, opting for ( p = 1 ) may overly simplify the model, potentially failing to capture all the relevant and significant relationships among the variables. Thus, while ( p = 1 ) offers a more parsimonious model, it might not sufficiently account for the complexity of the dynamics within the data, suggesting that a balance must be struck between model simplicity and its ability to elucidate significant interactions.
From the perspective of the p-values associated with these models, each demonstrates a moderate number of significant variables. Although a model with ( p = 1 ) lag might appear overly simplistic and possibly inadequate for capturing the full dynamics of the dataset, a model with ( p = 6 ) lags could potentially offer a slight improvement. This enhancement in performance suggests that incorporating more lag terms allows the VAR model to better account for the temporal dependencies among the variables, thus providing a more accurate and insightful analysis.
Examining the VAR models with ( p = 1 ) and ( p = 5 ) lag orders, we notice the presence of some variables that do not significantly contribute to the model. Notably, the model with ( p = 5 ) lag periods tends to have fewer significant variables compared to the ( p = 1 ) model. However, opting for ( p = 1 ) may overly simplify the model, potentially failing to capture all the relevant and significant relationships among the variables. Thus, while ( p = 1 ) offers a more parsimonious model, it might not sufficiently account for the complexity of the dynamics within the data, suggesting that a balance must be struck between model simplicity and its ability to elucidate significant interactions.
4.3 Cross Validation
To validate our hypothesis and ensure the robustness of our model selection, we employ cross-validation to compare the Root Mean Squared Error (RMSE) of the two chosen models. By assessing the RMSE, we can quantitatively determine which model provides a more accurate fit to the data.
var1 <-VAR(inde_df.ts, p=1, type="const")forecast(var1) %>%autoplot() +xlab("Week")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
The forecast results indicate a projected decline in daily vaccinations per million in the upcoming period, which aligns with our current understanding of the trends. Additionally, the support rate for the independent party is expected to decrease overall. These trends suggest significant shifts in public health dynamics and political landscapes, which warrant close monitoring and analysis to better understand the underlying factors driving these changes.
The best model for this relationship is VAR(6).
Code
var1 <-VAR(demo_df.ts, p=6, type="const")forecast(var1) %>%autoplot() +xlab("Week")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
The forecast results indicate a projected decline in daily vaccinations per million in the upcoming period, which aligns with our current understanding of the trends. Additionally, the support rate for the democratic party is expected to be stable with some fluctuations overall. These trends suggest significant shifts in public health dynamics and political landscapes, which warrant close monitoring and analysis to better understand the underlying factors driving these changes.
The best model for this relationship is VAR(5).
Code
var1 <-VAR(rep_df.ts, p=5, type="const")forecast(var1) %>%autoplot() +xlab("Week")+theme_bw()+theme(plot.background =element_rect(fill ="#D9E3F1", color =NA),panel.background =element_rect(fill ="#D9E3F1", color =NA))
The forecast results indicate a projected decline in daily vaccinations per million in the upcoming period, which aligns with our current understanding of the trends. Additionally, the support rate for the republican party is expected to be stable with some fluctuations overall. These trends suggest significant shifts in public health dynamics and political landscapes, which warrant close monitoring and analysis to better understand the underlying factors driving these changes.